Corpus for Benchmarking Clinical Speech De-identification
To address the scarcity of publicly available resources for clinical speech de-identification, this paper introduces the SREDH-AICup corpus, a time-aligned dataset comprising 20 hours of English and Mandarin audio annotated with 7,830 sensitive health information entities across 38 categories to support automated privacy protection research.